In December 2019, COVID-19 coronavirus was first identified in the Wuhan region of China. By March 11, 2020, the World Health Organization (WHO) categorized the COVID-19 outbreak as a pandemic. A lot has happened in the months in between with major outbreaks in Iran, South Korea, and Italy.
We know that COVID-19 spreads through respiratory droplets, such as through coughing, sneezing, or speaking. But, how quickly did the virus spread across the globe? And, can we see any effect from country-wide policies, like shutdowns and quarantines?
Please note that information and data regarding COVID-19 is frequently being updated. The data used in this project was pulled on March 17, 2020, and should not be considered to be the most up to date data available.
date - The date of the summaryprovince - The province or state, when applicablecountry - The country or region nameLat - Latitude pointLong - Longitude pointtype - the type of case (i.e., confirmed, death)cases - the number of daily cases (corresponding to the case type)head(coronavirus)
#> date province country lat long type cases
#> 1 2020-01-22 Afghanistan 33 65 confirmed 0
#> 2 2020-01-23 Afghanistan 33 65 confirmed 0
#> 3 2020-01-24 Afghanistan 33 65 confirmed 0
#> 4 2020-01-25 Afghanistan 33 65 confirmed 0
#> 5 2020-01-26 Afghanistan 33 65 confirmed 0
#> 6 2020-01-27 Afghanistan 33 65 confirmed 0
str(coronavirus)
#> 'data.frame': 87808 obs. of 7 variables:
#> $ date : Date, format: "2020-01-22" "2020-01-23" ...
#> $ province: chr "" "" "" "" ...
#> $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
#> $ lat : num 33 33 33 33 33 33 33 33 33 33 ...
#> $ long : num 65 65 65 65 65 65 65 65 65 65 ...
#> $ type : chr "confirmed" "confirmed" "confirmed" "confirmed" ...
#> $ cases : int 0 0 0 0 0 0 0 0 0 0 ...
total_cases <- coronavirus %>%
group_by(type) %>%
summarise(cases = sum(cases)) %>%
mutate(type = factor(type, levels = c("confirmed", "death", "recovered")))
total_cases
#> # A tibble: 3 x 2
#> type cases
#> <fct> <int>
#> 1 confirmed 4261747
#> 2 death 291942
#> 3 recovered 1493414
The following plot presents the cases (active, recovered, and death) distribution over time:
coronavirus %>%
group_by(type, date) %>%
summarise(total_cases = sum(cases)) %>%
pivot_wider(names_from = type, values_from = total_cases) %>%
arrange(date) %>%
mutate(active = confirmed - death - recovered) %>%
mutate(active_total = cumsum(active),
recovered_total = cumsum(recovered),
death_total = cumsum(death)) %>%
plot_ly(x = ~ date,
y = ~ active_total,
name = 'Active',
fillcolor = '#1f77b4',
type = 'scatter',
mode = 'none',
stackgroup = 'one') %>%
add_trace(y = ~ death_total,
name = "Death",
fillcolor = '#E41317') %>%
add_trace(y = ~recovered_total,
name = 'Recovered',
fillcolor = 'forestgreen') %>%
layout(title = "Distribution of Covid19 Cases Worldwide",
legend = list(x = 0.1, y = 0.9),
yaxis = list(title = "Number of Cases"),
xaxis = list(title = "Source: Johns Hopkins University Center for Systems Science and Engineering"))
The next table provides an overview of the ten countries with the highest confirmed cases. We will use the datatable function from the DT package to view the table:
confirmed_country <- coronavirus %>%
filter(type == "confirmed") %>%
group_by(country) %>%
summarise(total_cases = sum(cases)) %>%
mutate(perc = total_cases / sum(total_cases)) %>%
arrange(-total_cases)
confirmed_country %>%
head(10) %>%
datatable(rownames = FALSE,
colnames = c("Country", "Cases", "Perc of Total")) %>%
formatPercentage("perc", 2)
The next plot summarize the distribution of confirmed cases by country:
conf_df <- coronavirus %>%
filter(type == "confirmed") %>%
group_by(country) %>%
summarise(total_cases = sum(cases)) %>%
arrange(-total_cases) %>%
mutate(parents = "Confirmed") %>%
ungroup()
plot_ly(data = conf_df,
type= "treemap",
values = ~total_cases,
labels= ~ country,
parents= ~parents,
domain = list(column=0),
name = "Confirmed",
textinfo="label+value+percent parent")
Similarly, we can use the pivot_wider function from the tidyr package (in addition to the dplyr functions we used above) to get an overview of the three types of cases (confirmed, recovered, and death). We then will use it to derive the recovery and death rate by country. As for most of the countries, there is not enough information about the results of the confirmed cases, we will filter the data for countries with at least 25 confirmed cases and above:
coronavirus %>%
filter(country != "Others") %>%
group_by(country, type) %>%
summarise(total_cases = sum(cases)) %>%
pivot_wider(names_from = type, values_from = total_cases) %>%
arrange(- confirmed) %>%
filter(confirmed >= 25) %>%
mutate(death_rate = death / confirmed) %>%
datatable(rownames = FALSE,
colnames = c("Country", "Confirmed","Death", "Death Rate")) %>%
formatPercentage("death_rate", 2)
Note that it will be misleading to make any conclusion about the recovery and death rate. As there is no detail information about:
The following plot describes the overall distribution of the total confirmed cases in China by province:
coronavirus %>%
filter(country == "China",
type == "confirmed") %>%
group_by(province, type) %>%
summarise(total_cases = sum(cases)) %>%
pivot_wider(names_from = type, values_from = total_cases) %>%
arrange(- confirmed) %>%
plot_ly(labels = ~ province,
values = ~confirmed,
type = 'pie',
textposition = 'inside',
textinfo = 'label+percent',
insidetextfont = list(color = '#FFFFFF'),
hoverinfo = 'text',
text = ~ paste(province, "<br />",
"Number of confirmed cases: ", confirmed, sep = "")) %>%
layout(title = "Total China Confirmed Cases Dist. by Province")